ETC5521 Diving Deeper into Data Exploration: Assignment 2
As per Monash’s integrity rules, these solutions are not to be shared beyond this class.
Author
Prof. Di Cook
Published
August 12, 2024
🎯 Goal
The assignment is designed to exploring single and pairs of variables, and conduct appropriate tests for significance of patterns. The assignment represents 20% of your final grade for ETC5521. This is an individual assignment.
📌 Guidelines
Accept the GitHub Classroom Assignment provided in Moodle using a GitHub Classroom compatible web browser. This should generate a private GitHub repository that can be found at https://github.com/etc5521-2024. Your GitHub assignment 2 repo should contain the file assign02.html, README.md, assign02-submission.qmd, assignment.css, etc5521-assignment2.Rproj, .gitignore, and data files generated from your work needed to render your solution file. Code should be available but folded in report.
Answer each question in the assign02-submission.qmd in the repo.
For the final submission knit assign03-submission.qmd which will contain your answers. Make sure to provide the link to the script of any Generative AI conversation you employed in arriving at your solution. Note that marks are allocated for overall grammar and structure of your final report.
Leave all of your files in your GitHub repo for marking. We will check your git commit history. You should have contributions to the repo with consistent commits over time. (Note: nothing needs to be submitted to Moodle.)
You are expected to develop your solutions by yourself, without discussing any details with other class members or other friends or contacts. You can ask for clarifications from the teaching team and we encourage you to attend consultations to get assistance as needed. As a Monash student you are expected to adhere to Monash’s academic integrity policy. and the details on use of Generative AI as detailed on this unit’s Moodle assessment overview.
Deadlines:
Due date
Turn in
11:45pm Mon Aug 19
Assignment 2 Repo on GitHub has been created
11:45pm Mon Aug 26
Final solutions available on repo
🛠️ Exercises
Question 1: Becoming Tukey
The ABC news ran a story Mon 5 Aug 2024 titled Researchers argue for new, fairer Olympic rankings wishing for a fairer way to measure Olympic success. The official way to measure success is ranking countries by the number of gold medals. (Note that, when the USA was the top country based on total medals most US sources ranked countries by total count not by gold, but this is a side story. Now that it is several weeks in the USA is also leading on gold medals.)
However, by this system, the top-ranked nations rarely change. The question is posed whether there are fairer ways to measure success. The article discusses using medals per capita, and a new measure described in the motivating Journal of Sports Analytics called the “U Index”.
Your job is to:
Download the latest Olympic Medal tallies (should be the final tallies by the time the assignment is due to be submitted). There are various ways to get this data, in particular you are encouraged to script it with web scraping using rvest. Report the top 10 countries by gold medal count.
Compile the current population for each country, and compute a per capita medal count, and report for the top 10 countries.
Invent five other ways to calculate and rank the Olympic performance of competing countries using this data, possibly augmented with new data. Explain what you think your calculation is measuring about performance.
Solution
The best place to get the data is from kaggle but at the time of writing the assignment, it was only available in dynamically updated websites, like wikipedia. This is the code that can be used to scrape the data. Generally when you do this you want to save a local copy of the data, because the website might change, and the code may not work for long.
# Using wikipedia as the medal tally source# The downside of this is that it is missing the zeros# This data needs to be saved because web site is always changingpg <-read_html("https://en.wikipedia.org/wiki/2024_Summer_Olympics_medal_table")tbls <- pg |>html_elements("table")medals <- tbls[4] |>html_table()medals <- medals[1][[1]]# By number of goldmedals |>slice_head(n=10)medals <- medals |>filter(!str_detect(NOC, "Totals")) |>mutate(NOC =str_remove(NOC, "‡")) |>mutate(NOC =str_remove(NOC, "\\*")) |>mutate(Total =as.numeric(Total)) |>mutate(p_gold = Gold/Total) # Note, elected to keep individual athletes, and refugee medal winnerswrite_csv(medals, file="data/medals.csv")
# By number of goldmedals <-read_csv(here("assignments/data/medals.csv"))medals |>select(NOC, Gold) |>slice_head(n=10)
# A tibble: 10 × 2
NOC Gold
<chr> <dbl>
1 United States 40
2 China 40
3 Japan 20
4 Australia 18
5 France 16
6 Netherlands 15
7 Great Britain 14
8 South Korea 13
9 Italy 12
10 Germany 12
Population data can scraped from wikipedia also, with the code below.
# Best to save this data alsopop_tbl <-read_html("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population") |>html_elements("table") pop_tbl <- pop_tbl[1][[1]]pop <- pop_tbl |>html_table()pop <- pop |>select(Location, Population, `% ofworld`) |>filter(!is.na(Population)) |>mutate(Population =str_remove_all(Population, ","),`% ofworld`=str_remove(`% ofworld`, "%")) |>mutate(Population =as.numeric(Population),`% ofworld`=as.numeric(`% ofworld`))# Need to recode some countries, based on the # checks made in the next few linespop <- pop |>mutate(Location =fct_recode(Location,"Great Britain"="United Kingdom","Hong Kong"="Hong Kong (China)","Chinese Taipei"="Taiwan","Puerto Rico"="Puerto Rico (US)"))write_csv(pop, file="data/pop.csv")
pop <-read_csv(here("assignments/data/pop.csv"))# Check country matchingmedals_pop <- medals |>filter(!(NOC %in%c("Totals (80 entries)", "Individual Neutral Athletes[A]","Refugee Olympic Team", "Individual Neutral Athletes[A][B]"))) |>mutate(NOC =str_remove(NOC, "\\*")) |>left_join(pop, by=c("NOC"="Location"))#medals_pop |> # filter(is.na(Population)) |># pull(NOC) # All matched after name changesmedals_pop <- medals_pop |>mutate(percap = Gold/Population*1000000)medals_pop |>arrange(desc(percap)) |>select(NOC, percap) |>slice_head(n=10)
There are many possible choices of metrics to create, including these from the submissions:
proportion of gold to total
use a countries GDP or socioeconomic index instead of population
total medals: so countries without top performance feature.
using a point system: which gives a weighted rank.
by number of athletes: teams with small number of high-performing athletes at top
by number of athletes but calibrated by number of events each participated in.
by number of athletes but calibrated by number of team members.
medals per GDP
improvement over 2020 Olympics: Total 2024 - Total 2020. Note this should be divided by Total 2020, to assess relative increase. Any country that scores a lot of medals would likely score more if only using the difference. - gender balance: (male total - female total) / total
# By proportion of goldmedals |>arrange(desc(p_gold)) |>select(NOC, p_gold) |>slice_head(n=10)
Here, each line corresponds to a country. We can see that:
Country rank from gold medal count is very similar to method 4 (weighted medal count). There are a few countries who go down dramatically and a few that go up in rank.
The ranking from methods 2 (percapita) and 3 (proportion gold) are quite different, from each other and from the other two.
MARKING:
1 pt for (a) includes explaining how edal data was obtained.
1 pt for (b) includes explaining how population data was obtained.
2.5 pts each of the five metrics in (c).
1.5 pts for explaining how the metrics are calculated and what they measure in (c).
Question 2: Chatfield-style IDA
Download the latest data from the World Development Indicators. Use the last 20 years of records (2004-2022), and all the countries. Select just the variables that have the keyword “fuel” in the explanation.
Your task is to prepare the data, by doing the initial data analysis, needed to answer the question
Is access to clean fuels for cooking related to fuel imports or exports?
You are NOT expected to answer the question. The IDA will likely involve steps such as:
filtering countries with few values
removing variables with few values
imputing missings
screening data
Your script needs to completely document all of the IDA conducted, with comments describing the code and decisions made. The raw data and valid data sets need to be added to your repo.
In your report, provide an outline of the IDA conducted, and a few summary statistics and plots of your “valid” data set.
Solution
First step is to read the data and tidy it. It is a good place to save the tidied data, and other data products, like the data dictionary (explanation of variables) and full country names with their codes.
# The max removes summary text from bottom of spreadsheetwdi <-read_xlsx(here::here("data/P_Data_Extract_From_World_Development_Indicators.xlsx"), n_max =4788)wdi_cnt <- wdi |>select(`Country Name`, `Country Code`) |>distinct()wdi_vars <- wdi |>select(`Series Name`, `Series Code`) |>distinct()# Note 2023 is missingwdi_tidy <- wdi |>select(`Country Code`, `Series Code`, `2004 [YR2004]`:`2022 [YR2022]`) |>rename_all(janitor::make_clean_names) |>pivot_longer(x2004_yr2004:x2022_yr2022,names_to ="year", values_to ="value") |>mutate(year =as.numeric(str_sub(year, 2, 5)),value =as.numeric(value)) |>pivot_wider(names_from = series_code,values_from = value)save(wdi_tidy, file="data/wdi_tidy.rda")save(wdi_cnt, file="data/wdi_cnt.rda")save(wdi_vars, file="data/wdi_vars.rda")
The next step is to check the missingness structure. The process then is to remove variables and countries while retaining as much data as possible. Remove variables that are mostly missing, then countries that are mostly missing. And iterate this, until you have a reasonably large set remaining, but would be able to impute the missing values well.
Here we start with examining missing value summaries.
We can see that some countries have 100% clean fuels available to all.
# Might also need to filter out countries with constant values # over time on these variables. Skip this though.wdi_clean_const <- wdi_clean |>group_by(country_code, series) |>summarise(value =sd(value)) |>ungroup()drop <- wdi_clean_const |>filter(value <0.001) |>select(country_code) |>distinct() |>pull(country_code)wdi_valid <- wdi_clean |>filter(!(country_code %in% drop))
Save the final valid data set to a file, for later analysis.
write_csv(wdi_valid, "data/wdi_valid.csv")
MARKING:
1 pt for having correct data from WDI database
3 pts for missing value summaries (heatmap, %variable, %country)
2 pts for imputing appropriately using na_ma
1 pt for finishing with roughly 150 countries and the five variables
1 for summarising final data saved.
Question 3: Check the data collection
This paper describes an experiment to determine if a choropleth map or a hexagon tile map is better for displaying different types of statistical measurements spatially, especially focused on Australia.
trend: type of pattern simulated in the data, NW-SE, all cities, three cities
location: which plot in the lineup of 20 was the data plot
type: plot type, “Geography” is a choropleth map, “Hexagons” is a hexagon tile map
choice: which plot from the lineup was is that the subject picked
reason: their explanation for the choice
certainty: how confident were they in their choice
time: timestamp to allow examination of time taken
order: each subject saw 12 lineups in this order
replicate: different simulated data generated
id: unique id for each lineup used
detect: was calculated from whether the plot chosen matched the data plot
Your job is to check whether the data collected is completely covering the experimental design, or whether there were any problems with the data. Your answer should include information, and summary statistics and/or plots that:
identify which columns of the data match the factors in the experiment.
check the number of measurements collected for each of the treatments.
Solution
Factors:
trend: NW-SE, all cities, three cities
type: Geography, Hexagons
Measured variables:
choice: which plot the subject picked
reason: their explanation for the choice
certainty: how confident were they in their choice
time: timestamp to allow examination of time taken
detect: was calculated from whether the plot chosen matched the data plot
The group (A or B) was used for assignment ot people to treatments, to make sure that each subject saw either the Geography or Hexagon plot of any data but not both.
There were three replicates for each trend type.
The location of the data plot was randomised in each lineup. This is not a factor, just a precaution preventing people from expecting the data plot to be in a particular location.
Randomising the order that the plots were shown to subjects was a precaution against an effect occurring based on when the plots were viewed.
The steps to checking the design are:
compare counts by factors
compare counts by treatments
discover the experimental design by pivoting on type or group to see same data was not shown to a subject twice
check order lineups seen has not particular patterns
In summary, it’s a very nicely organised experiment. The treatments were seen relatively equally among subjects. The order from one person to another was beautifully varied. The sets A and B ensured that no-one saw the same data twice.
The only small issues were slightly different numbers saw set A and B. And there may have been a small concentration of the hexagon type being seen after the geography type.
MARKING:
2 points for appropriately listing factors, and measured variables
2 points for appropriate numerical tabulation of factors
2 points for the more sophisticated checking which reveals the assignment to subjects, and order lineups were provided to subjects.
Generative AI analysis
In this part, we would like you to actively discuss how generative AI helped with your answers to the assignment questions, and where or how it was mistaken or misleading.
You need to provide a link that makes the full script of your conversation with any generative AI tool accessible to the teaching staff. You should not use a paid service, as the freely available systems will sufficiently helpful.
Solution
The link to my use of ChatGPT for help on this project is XXX.
Marks
Part
Points
Q1 worth
6
Q2 worth
8
Q3 worth
6
GitHub Repo
-2
Generative AI Analysis
-3
Formatting, Spelling & Grammar
-3
Note that the negative marks for “Generative AI Analysis”, “Formatting, Spelling & Grammar” correspond to reductions in scores. You can lose up to 3 marks for poor use of the GAI. For example, no use, basic questions only, no link to the script, and no acknowledgment but clearly used. You can lose up to 3 marks for poorly formatted and written answers. Two marks will be deducted if you have NOT accepted the assignment and created your own repo by 11:45pm Mon Aug 19.
Solution
Here are some guidelines for code:
Split chunks that process data and ones that plot data
It is dangerous to run scraping code each render because web sites often change. The wikipedia page has changed since the assignment was made available. Always best to scrape once, and save the result.
Label the chunks - helps to locate errors
Comment the code so you can remember what each is supposed to do when you come back to it.
Learn how to name things well, see https://github.com/jennybc/how-to-name-files